LegalTech for Consumer Protection
  • Project
  • Software
    Google PersonaBot Facebook Ads Data Analysis Dutch Products Crawler Facebook PersonaBot
  • Legal
    License Privacy Policy
  • Search
  1. Software
  2. Facebook Ads Data Analysis

Facebook Ads Data Analysis¶


In the following notebook, it is explored methodologies and limitations of using the Facebook Graph API to access the Facebook Ads library and perform data analysis on the 2021 Dutch General Elections and get key insights of advertisement practices

drawing

The notebook is divided in 3 main sections:

  • 1. Accessing the Facebook API as General user and Marketers
  • 2. Accessing the Facebook API as Developer
  • 3. Discovering Key Insights Analysing Ads Data

The first section discusses how to use and access Facebook ads data from a general user perspective via the Facebook Ads Library service, it is explored doing a generic search. The second section shows how to connect directly to the Graph API in developer mode and gather advertisement data directly, and thus perform data analysis. Finally, the third section presents a pilot data analysis exploring some key insights that could be useful for Consumer Protection Investigations or Compliance on Competition Law practices

Key Insights Summary¶

In this pilot data analysis, advertisement data has been collected on more than 8k ads using Facebook's Ad Library API. It is been searched the API by all advertisers and only include the period range around the Dutch General Elections in 2021, i.e. January to April 2021, all campaigns collected were inactive ones. We have observed that the average time of ads impressions is 4 days to the consumers, particularly, the 65+ were the most microtargeted group with ads impressions for almost a month. Interestingly, there were certain ads that microtargeted very specific groups reaching a huge audience (more than a 1M impressions) in a short amount of time (less than a day), these came from Jonge Socialisten in de PvdA, Wopke Hoekstra (CDA) and Woonbond, targeting 18-24 female, 18-24 male and 65+ male respectively. The party CDA was by far the one with more unique ads created with more than 2k, followed by smaller parties like Volt with 409 and DENK with 329. Furthermore, CDA spent up to 150,000 Euros on campaigns it wasn't the one that spent the most. Forum voor Democratie (FVD) spent up to 220,000 Euros in ads campaigns. The least microtargeted regions were Friesland and Zeeland, meaning that there were no campaigns that were intended exclusively for those regions. Marketers choose the last Friday before the election days (March 12) to launch the majority of the campaigns (3-4 days prior to voting), the ads were targeted generally in all provinces except Zeeland, Groningen and Friesland. In terms of topics, there was no special patterns found, only the mentioning of lockdown and crisis plan was new. Additionally using text similarity metrics (Levinstein distance) between the links and the pages we did not find doubtful links that would link to dubious websites. Note that some advertisers were mistakenly classified as political by Facebook's algorithms because their ad texts may contain words associated with political issues like "crisis", "environment" or "freedom".

Data acquisition and analysis were done using the Python programming language. Plotly for visualization and word frequencies by using the Natural Language Toolkit. The ads data and code used for the analysis can be found in the Github repo of this work. Any comments, please contact us at p.hernandezserrano@maastrichtuniversity.nl


Date: Jun 2021
Author: Pedro V Hernández Serrano
License: Attribution 4.0 International (CC BY 4.0)


1. Accessing the Facebook API as General user and Marketers¶


The Facebook Ads Library is an open repository of all the ads and campaigns, active and inactive from many countries that Facebook runs. This repository is exposed as a service with a search engine interface that will help users to easily look up for ads, campaigns, Facebook pages, etc. in the Facebook ads database, this service is naturally connected to the Facebook Graph API. The search interface has two filters Country and Ad Type.

Limitations:

  • The service is limited for the user to select one country at a time
  • The service only allows to look up All ads at once or the Issues, Elections or Politics category. Unfortunately Facebook is not making public the rest of their defined categories, this should be followed up by data-related regulation conversations.
  • The service shows the details of each ad, BUT actually only show the ad identifier, the link to the page, and the ad content, ONLY the Issues, Elections or Politics ads contain information about the demographics of the audiences, the rest of the ads do NOT contain demographics

Interface example:

drawing

Once the search is defined one can browse all the ads related to the keywords entered. Additional filters will appear. Active/Inactive, Advertiser, Platform and Impressions by date.

drawing

One can see the details of each ad, but important to note that the details only show the ad identifier, the link to the page, and the ad content, any information about the demographics of the audiences that this ad was shown NOT presented by Facebook

drawing

Using the API is impossible to do the last query since Facebook is only making public the parameter POLITICAL_AND_ISSUE_ADS therefore the rest of the ads are not accessible via the API

drawing

The following EU technical in ads transparency report documents the use of Facebook Graph API and how was used for analysing general elections ads https://adtransparency.mozilla.org/eu/methods/

There are a number of uses by accessing the Facebook Ads Library service, normally, marketers will get inspiration or impact in different campaigns worldwide. But also a general user can look back to certain ads that have been seen in the past to get details about the products that were offered. The database is huge, and in principle, having a suspicious Facebook Page id or a particular keyword combination one could aim to collect evidence of unfair or illegal practices in advertising

2. Accessing the Facebook API as Developer¶


The Facebook Graph API is the primary way for apps to read and write to the Facebook social graph. The official documentation is found in APIs and SDKs docs here. The Graph API has many uses, from creating and publish a game to analyze friends networks, naturally contains the Ads that Facebook publishes but only a limited amount of those, only limited to social issues and politics. In order to access and use the API, you need to gain access to the Facebook Ads Library API at https://www.facebook.com/ID and confirm your identity for running (or analysing) Ads About Social Issues, Elections or Politics, which involves receiving a letter with a code at your official account and sending picture identification to Facebook. Basically one has to be registered as an official Facebook developer, and this permission can actually take from one day to weeks.

The Facebook API has also a nice user interface (for developers) called the Graph API Explorer, which allows the developer or analyst to quickly generate access tokens, get code samples of the queries to run, or generate debug information to include in support requests. Here more info.

Requirements

  1. Register as a developer at developers.facebook.com
  2. Go to Graph API explorer and create an app
  3. Having a new app ID. Create a Token for the new app in the UI
  4. Define the Graph API Node to use: ads_archive

An example query that can be retrieved from the Graph API Explorer is the following

        ads_archive?access_token=[TOKEN]
        &ad_type=POLITICAL_AND_ISSUE_ADS
        &ad_active_status=ALL
        &fields=ad_creation_time%2Cad_creative_body2Cpage_name%2Cdemographic_distribution
        &limit=100
        &ad_reached_countries=NL
        &search_terms=.

There are of course a number of clients that can perform API calls, in the following pilot data analysis we are using a Python implementation

Interface example:

3. Discovering Key Insights Analysing Ads Data¶


The following section is divided into Data Collection and Descriptive Statistics in order to understand better the data, finally, we are going to briefly discuss the insights.

Facebook Ads - Data Collection¶


Max Woolf's facebook-ad-library-scraper it's the best out-the-box solution (as of early 2021) to to retrive ads data from a Python client, since it requires minimal dependencies.

  • Usage: Configure the script via config.yaml. Go to https://developers.facebook.com/tools/explorer/ to get a User Access Token, and fill it in with your token (it normally expires after a few hours but you can extend it to a 2 month token via the Access Token Debugger). Change other parameters as necessary.

  • Run the scraper script: In order to install an run the only requirements we use the following:

!pip3 install requests tqdm plotly

!python fb_ad_lib_scraper.py

  • Outputs: This script outputs three CSV files in an ideal format to be analyzed.

fb_ads.csv: The raw ads and their metadata.
fb_ads_demos.csv: The unnested demographic distributions of people reached by ads, which can be mapped to fb_ads.csv via the ad_id field.
fb_ads_regions.csv: The unnested region distributions of people reached by ads, which can be mapped to fb_ads.csv via the ad_id field.

The following notebook extracts over 8000 inactive ads by querying "stem" filtering therefore the ones related to the lections and setting a manageable limit for the API.

  • The data: The fields include all details about the ads, like creation time, link, description, and caption. Moreover, each ad is associated with a funding entity and a source page, this source page is normally the marketer or the original Facebook page associated with the ad, finally, there is information on the amount spend, and demographics like gender and regions.

      Age groups:  
        - 18-24
        - 25-34
        - 35-44
        - 45-54
        - 55-64
        - 65+
    
      Gender groups:  
        - male
        - female
        - unknown
    
      Regions:  
        - Noord-Holland
        - Zuid-Holland
        - Gelderland
        - North Brabant
        - Utrecht
        - Groningen
        - Overijssel
        - Flevoland
        - Limburg
        - Friesland
        - Zeeland

Extracting and reading the data¶

In [1]:
!python fb_ad_lib_scraper.py
 29%|██████████▉                           | 8601/30000 [01:26<03:34, 99.56it/s]Traceback (most recent call last):
  File "fb_ad_lib_scraper.py", line 61, in <module>
    for demo in ad['demographic_distribution']:
KeyError: 'demographic_distribution'
 29%|██████████▉                           | 8621/30000 [01:26<03:34, 99.78it/s]
In [2]:
import pandas as pd

df_demographics = pd.read_csv('data/fb_ads_demos.csv')
df_regions = pd.read_csv('data/fb_ads_regions.csv')
df_ads = pd.read_csv('data/fb_ads.csv')

Facebook Ads - Descriptive Statistics¶


The following section is focused on the statistical methodologies for describing the key insights of the Facebook Ads data, it is important to note that no hypothesis testing is performed in the current pilot data analysis, meaning that no correlation analysis or causal effects are studied. The purpose of this section is to understand the key insights of the data and therefore ask questions about ads practices.

Number of unique ads¶

The number of unique ads considered for this pilot data analysis is:

In [3]:
len(df_ads.ad_id.unique())
Out[3]:
8621

Ads Impressions Period¶

The ads run following the configuration set up that was made by the campaign creator, normally the conditions are that the campaign will run until the funds are exhausted, and there is a declared min and max per ad. Following this logic, some ads can run for hours, but some others for days, here we find the average, max, min and count of each ad

In [4]:
# Converting to datetime
df_ads[['ad_delivery_start_time', 'ad_delivery_stop_time']] =\
    df_ads[['ad_delivery_start_time', 'ad_delivery_stop_time']]\
    .apply(pd.to_datetime)
# Ads delivery period
df_ads['ad_delivery_start_time'].describe(datetime_is_numeric=True)
Out[4]:
count                             8619
mean     2021-01-02 23:17:33.811346944
min                2019-09-24 00:00:00
25%                2021-01-21 00:00:00
50%                2021-03-10 00:00:00
75%                2021-03-14 00:00:00
max                2021-07-16 00:00:00
Name: ad_delivery_start_time, dtype: object

The period of the dataset that was extracted from the API is 2019-09, to 2021-07. It is taken this period of time, since one can't specify the period in the API call, the date filter has to be done at the data level. For this analysis, we are taking 2021-01-01 to 2021-05-01.

In [5]:
# Filtering the period and adjusting
df_ads = df_ads\
    .query("ad_delivery_start_time >= '2021-01-01' & ad_delivery_start_time < '2021-05-01'")
#unique campaigns 
#df_ads.drop_duplicates(subset=['ad_creative_body'], keep='first')

# Creating a new feature, ads time: how long the ads were displayed in days
df_ads['ads_time'] = df_ads['ad_delivery_stop_time'] - df_ads['ad_delivery_start_time']
# Treat them as integer is useful also
df_ads['ads_days_time'] = pd.to_numeric(df_ads['ads_time'].dt.days, downcast='integer')

# Ads general stats
df_ads['ads_time'].describe()
Out[5]:
count                         6485
mean     4 days 07:32:59.028527370
std      5 days 08:53:42.355982903
min                0 days 00:00:00
25%                1 days 00:00:00
50%                3 days 00:00:00
75%                5 days 00:00:00
max               43 days 00:00:00
Name: ads_time, dtype: object

The campaigns are online in average 4 days and 7 hours, with a standard deviation of 5 days, the max campaign duration is 43 days.

In [7]:
df_ads.sort_values(by='ads_time',ascending=False).head(5)
Out[7]:
ad_id page_id page_name ad_creative_body ad_creative_link_caption ad_creative_link_description ad_creative_link_title ad_delivery_start_time ad_delivery_stop_time funding_entity impressions_min spend_min spend_max ad_url currency ads_time ads_days_time
6448 161990739027528 1550088745275913 DENK DENK wil dat ondernemers meer steun krijgen ti... bewegingdenk.nl CoronapandemieSteun DENK Steunen van het MKB D... Wij willen ondernemers steunen! 2021-01-21 2021-03-05 DENK 15000 0 99 https://www.facebook.com/ads/library/?id=16199... EUR 43 days 43
6453 867048690534134 1550088745275913 DENK DENK wil dat ondernemers meer steun krijgen ti... www.bewegingdenk.nl NaN Wij willen ondernemers steunen! 2021-01-21 2021-03-05 DENK 3000 0 99 https://www.facebook.com/ads/library/?id=86704... EUR 43 days 43
6447 440021530742598 1550088745275913 DENK DENK wil dat ondernemers meer steun krijgen ti... bewegingdenk.nl CoronapandemieSteun DENK Steunen van het MKB D... Wij willen ondernemers steunen! 2021-01-21 2021-03-05 DENK 20000 0 99 https://www.facebook.com/ads/library/?id=44002... EUR 43 days 43
6456 484767479583459 1550088745275913 DENK DENK wil dat ondernemers meer steun krijgen ti... www.bewegingdenk.nl NaN Wij willen ondernemers steunen! 2021-01-21 2021-03-05 DENK 10000 0 99 https://www.facebook.com/ads/library/?id=48476... EUR 43 days 43
6455 443594963659555 1550088745275913 DENK DENK wil dat ondernemers meer steun krijgen ti... www.bewegingdenk.nl NaN Wij willen ondernemers steunen! 2021-01-21 2021-03-05 DENK 2000 0 99 https://www.facebook.com/ads/library/?id=44359... EUR 43 days 43

The longest campaign was the following ad from the DENK party, mentioning "coronacrisis". This ad was online for 43 days.

drawing

Here the actual URL to the archive:
https://www.facebook.com/ads/library/?active_status=all&ad_type=political_and_issue_ads&country=NL&q=161990739027528&sort_data[direction]=desc&sort_data[mode]=relevancy_monthly_grouped&search_type=keyword_unordered&media_type=all

Authenticity of Links¶

As we could see, the ads would normally link you to the party page. However, some mischievous advertisers and marketers sometimes would use the ads to promote certain links or websites, this website would link to a scam page or sometimes viruses. One way to check the authenticity of the links is to actually compare the link URL to the name of the original page. Following this approach if the page name is D66 then the associated link is d66.nl or similar, the same for not partisan websites like KiesKlimaat and its link kiesklimaat.nl. Obviously, there would be a number of false positives, nevertheless is a good indication of dubious links.

It is used the Levenshtein distance to measure the similarity between the link and the page name, the closer to 0 the similar the names.

In [9]:
import Levenshtein as lev
# Get the page name and the link to compare the authenticity
df = df_ads[['ad_creative_link_caption','page_name']].dropna()
df = df[df.ad_creative_link_caption.str.contains(".")]
distances = []
for index, row in df.iterrows():
    distances.append((row['page_name'], row['ad_creative_link_caption'], lev.distance(row['page_name'], row['ad_creative_link_caption'])))

# The closer the distance the more alike the link is to the advertiser name
df_distances = \
    pd.DataFrame(distances)\
    .rename(columns={0:'page_name',1:'link',2:'levenshtein_distance'})\
    .drop_duplicates(subset=['link'], keep='first')\
    .sort_values('levenshtein_distance',ascending=True)\
    .reset_index(drop=True)
In [38]:
df_distances.tail(10)
Out[38]:
page_name link levenshtein_distance
266 Atria, voor gendergelijkheid en vrouwengeschie... stemgendergelijkheid.nl 33
267 International Campaign for Tibet Europe savetibet.nl 33
268 Feniks, Emancipatie Expertise Centrum Tilburg fenikstilburg.nl 35
269 Samen Kerk in Nederland - SKIN Verkiezingsdebat: internationale kerken & de p... 40
270 Mijn Stem www.energievannoordoosttwente.nl/wup/weerselo 41
271 CDA https://www.cda.nl/standpunten/krimpregio 41
272 Mijn Stem www.energievannoordoosttwente.nl/wup/manderveen 42
273 TivoliVredenburg Livestream: TivoliVredenburg viert Vrouwendag ... 43
274 Rectoraat San Thomas d'Aquino Actiegroep; lijn 2 Holtenbroek per direct teru... 54
275 Multicultureel Jongeren Geluid Het grote (online) Schilderswijkdebat onder le... 57

There are no immediate insights of dubious links in the Ads before the Dutch general elections. However, a way to verify this is to export this list and manually fact-checking authenticity.

The following example shows how a link is similar to the page name and therefore is having a distance closer to 0.

In [41]:
df_distances.head(5)
Out[41]:
page_name link levenshtein_distance
0 D66 d66.nl 4
1 BIJ1 BIJ1.org 4
2 Inwonersbelangen inwonersbelangen.nl 4
3 Dierenbescherming dierenbescherming.nl 4
4 ifaw ifaw.org 4

Amount of Ads Rate Change¶

What are the peaks and rises on the ads shown in the last weeks before the election? It would be interesting to note that some advertisers and pages suddenly invested the whole budget way too early, leaving fewer impressions towards the end.

In [11]:
# Group the advertisers by month
feb_march =\
    df_ads.groupby(['page_name', df_ads.ad_delivery_start_time.dt.to_period('M').astype(str)])\
    .size()\
    .loc[slice(None), slice('2021-02', '2021-03')]\
    .unstack()
# Top 10 rate of change between the two months
feb_march = feb_march\
    .assign(change = (feb_march['2021-03'] - feb_march['2021-02']) / feb_march['2021-03'])\
    .sort_values('change')
feb_march.head(10)
Out[11]:
ad_delivery_start_time 2021-02 2021-03 change
page_name
ActionAid Nederland 28.0 3.0 -8.333333
VluchtelingenWerk Nederland 29.0 4.0 -6.250000
ROSE stories 7.0 1.0 -6.000000
NLBeter 32.0 5.0 -5.400000
Nederland KansRijk 26.0 5.0 -4.200000
D66 Medemblik 4.0 1.0 -3.000000
Dierenbescherming 104.0 39.0 -1.666667
BN DeStem 5.0 2.0 -1.500000
Milieudefensie 33.0 16.0 -1.062500
Vereniging Basisinkomen 2.0 1.0 -1.000000
In [12]:
feb_march.tail(10)
Out[12]:
ad_delivery_start_time 2021-02 2021-03 change
page_name
Wieke Paulusma - kandidaat Tweede Kamerlid D66 1.0 NaN NaN
Windmolens en zonneweide in Baarle Nassau NaN 1.0 NaN
Woonbond NaN 36.0 NaN
Wopke Hoekstra NaN 933.0 NaN
World Animal Protection Nederland NaN 6.0 NaN
Wybren van Haga NaN 4.0 NaN
Wytske Postma CDA NaN 1.0 NaN
de Bergse VVD NaN 5.0 NaN
terug naar de Bijbel NaN 1.0 NaN
السوريين في هولندا ,مشاكل وحلول NaN 1.0 NaN

There are some advertisers that are obviously non political, we won't include them. Facebook mislabel them since inlcudes keywords related to social issues. It also appears that Wopke Hoekstra went "all in" towards the end.

Number of Ads by Advertiser¶

Who is actually the page/party that is putting more ads on Facebook?
To answer this, we simply make a count of unique ads by each one of the pages, it is interesting to note that certain campaigns of the same party run in parallel.

In [13]:
# ploting library
import plotly.express as px
In [14]:
# Aggregation and counting
df_ads_count = df_ads\
    .groupby('page_name')\
    .count()['ad_id']\
    .reset_index()\
    .sort_values(by='ad_id',ascending=False)\
    .rename(columns={'ad_id':'Ads Count', 'page_name':'Advertiser'})\
    .head(20)
In [15]:
px.bar(df_ads_count.sort_values(by='Ads Count',ascending=True),
       x='Ads Count', y='Advertiser', labels=None,
       orientation='h', color='Ads Count', color_continuous_scale='blues',title='Number of Ads by Advertiser')

The party CDA was by far the one with more unique ads created with more than 2k (counting CDA together with the personal Wopke Hekstra page), followed by smaller parties like Volt with 409 and DENK with 329

Total Spent by Advertiser¶

Who is actually the big spender in this game? The ads details contain the max and min budget by ad impression that each ad is budgeted, we have calculated the median of those points and accumulate the amount in Euros per page/party.

In [16]:
# For calculation
import numpy as np
# Calculate median of impression and ad spend ranges to get a more realistic estimate
df_ads['spend_median'] = df_ads[['spend_min', 'spend_max']].apply(np.median, axis = 1)
#df_ads['impressions_median'] = df_ads[['impressions_min', 'impressions_max']].apply(np.median, axis = 1)
In [17]:
# Aggregation and counting
df_ads_spend = df_ads.query("currency == 'EUR'")\
    .groupby('page_name')\
    .sum()['spend_median']\
    .reset_index()\
    .sort_values(by='spend_median',ascending=False)\
    .rename(columns={'spend_median':'EUR Spent', 'page_name':'Advertiser'})\
    .head(20)
In [18]:
px.bar(df_ads_spend.sort_values(by='EUR Spent',ascending=True), x='EUR Spent', y='Advertiser', labels=None,
       orientation='h', color='EUR Spent', color_continuous_scale='blues',title='Number of Ads by Advertiser')

Even though FvD created only 94 unique ads, they were the big spenders, potentially meaning that they actually put less effort on the creation and simply more budget per ad. This could only mean that they have a more consistant message (at least from the ads).

Demographic Groups Distribution¶

The Facebook Ads Library does not include any information of the actual people that were exposed to the ads, this would be an actual problem for Facebook nowadays. Instead, the API provides aggregated statistics of the demographics groups that were displayed by each ad. For example, the ad number 785007302430176 was displayed 60% to men and 40% to women, as well as 80% to 65+ and 20%, to '18-24' group. We could therefore ask different questions such as. Which groups are longer exposed?

In [19]:
# Maximum exposure time of advertisement per group
df_demographics = \
    df_demographics\
    .merge(df_ads[['ad_id','ads_time','ads_days_time']], on='ad_id', how='left')

df_demographics\
    .groupby(['age','gender'])\
    .describe()['ads_time']\
    .reset_index()\
    .sort_values(by='mean',ascending=False)\
    .head(10)
Out[19]:
age gender count mean std min 25% 50% 75% max
0 13-17 female 237 7 days 10:25:49.367088607 6 days 22:39:35.312600885 0 days 00:00:00 2 days 00:00:00 4 days 00:00:00 12 days 00:00:00 23 days 00:00:00
1 13-17 male 239 7 days 09:08:17.071129707 6 days 23:11:37.949375119 0 days 00:00:00 2 days 00:00:00 4 days 00:00:00 12 days 00:00:00 23 days 00:00:00
2 13-17 unknown 434 7 days 01:59:26.820276497 7 days 03:45:03.555578597 0 days 00:00:00 2 days 00:00:00 4 days 00:00:00 11 days 00:00:00 43 days 00:00:00
21 Unknown unknown 733 6 days 14:59:45.266030013 6 days 22:36:43.674256390 0 days 00:00:00 2 days 00:00:00 4 days 00:00:00 10 days 00:00:00 43 days 00:00:00
4 18-24 male 6485 4 days 07:32:59.028527370 5 days 08:53:42.355982903 0 days 00:00:00 1 days 00:00:00 3 days 00:00:00 5 days 00:00:00 43 days 00:00:00
5 18-24 unknown 6485 4 days 07:32:59.028527370 5 days 08:53:42.355982903 0 days 00:00:00 1 days 00:00:00 3 days 00:00:00 5 days 00:00:00 43 days 00:00:00
20 65+ unknown 6485 4 days 07:32:59.028527370 5 days 08:53:42.355982903 0 days 00:00:00 1 days 00:00:00 3 days 00:00:00 5 days 00:00:00 43 days 00:00:00
19 65+ male 6485 4 days 07:32:59.028527370 5 days 08:53:42.355982903 0 days 00:00:00 1 days 00:00:00 3 days 00:00:00 5 days 00:00:00 43 days 00:00:00
18 65+ female 6485 4 days 07:32:59.028527370 5 days 08:53:42.355982903 0 days 00:00:00 1 days 00:00:00 3 days 00:00:00 5 days 00:00:00 43 days 00:00:00
17 55-64 unknown 6485 4 days 07:32:59.028527370 5 days 08:53:42.355982903 0 days 00:00:00 1 days 00:00:00 3 days 00:00:00 5 days 00:00:00 43 days 00:00:00

Distributions by Group¶

Moreover, we could explore which groups tend to be more microtargeted. Meaning that a campaign creator would only focus on a particular combination of demographical attributes. For instance "Female 18-24 group".

In [20]:
# Creating a cross-table between gender and age groups per percentage of impresions
pivoted_demographics = df_demographics\
    .query("age != 'All (Automated App Ads)' & age != 'Unknown' & gender != 'All (Automated App Ads)' & percentage > 0.3")\
    .pivot_table(values='percentage', index=['gender','ad_id'], columns=['age'], aggfunc='max')\
    .reset_index()\
    .drop(columns=['ad_id'])

pivoted_demographics['gender_code'] = pivoted_demographics['gender']
pivoted_demographics\
    .gender_code\
    .update(pivoted_demographics.gender_code.map({'unknown':0,'male':1,'female':2}))
In [21]:
fig = px.parallel_coordinates(pivoted_demographics, 
                             color="gender_code",
                              color_continuous_scale=[(0.00, "white"),   (0.33, "white"),
                                                     (0.33, "red"), (0.66, "red"), #male
                                                     (0.66, "teal"),  (1.00, "teal")]) #female

fig.update_layout(coloraxis_colorbar=dict(
    title="Percentage by Grpup",
    tickvals=[1,2,3],
    ticktext=["Male","Female","Unknown"],
    lenmode="pixels", len=150,
))
fig.show()

The only outstanding patterns are that there is a higher tendency to micro-target females, we can see that in each peak of the age group, that the female group received 100% of the ads.

In [22]:
px.scatter(df_demographics.query("age != 'All (Automated App Ads)' & age != 'Unknown' & gender != 'All (Automated App Ads)'"),
           x="age", y="ads_days_time", color="percentage", color_continuous_scale='RdBu',
           category_orders={"age": ["13-17", "18-24", "25-34", "35-44", "45-54", "55-64", "65+"]},
           labels=dict(age="Age Group", ads_days_time="Ads Impression in Days", percentage="Percentage"),
           size='percentage', hover_data=['gender'],title='Ads Exposure by Age Group')

No matter the age group, normally the campaigns would run proportionally, the youngest group 13-17 still received some ads, even thought they don't vote and the 65+ are the highest microtargeted

Microtargeted Ads¶

Given the above observations, we could explore further the more clever marketing tricks, this microtargeting that can reach audiences in the most optimal way, like for instance targeting one specific demographic group. To gather those we get only the ads are only displayed by region, gender and age quickly exhausting the budget.

In [23]:
ads_one_group = df_demographics\
    .query("age !='All (Automated App Ads)' and percentage > 0.9" )\
    .dropna()\
    .sort_values('ads_days_time', ascending = True)
ads_one_group.head(20)
Out[23]:
ad_id age gender percentage ads_time ads_days_time
96306 430691768045335 35-44 male 1.000000 0 days 0.0
104194 768477930746491 18-24 female 1.000000 0 days 0.0
96414 961250341291492 65+ male 1.000000 0 days 0.0
38470 2672202179737865 18-24 male 1.000000 0 days 0.0
38920 801162894083104 25-34 female 1.000000 0 days 0.0
38956 3693120407474936 55-64 male 1.000000 0 days 0.0
39244 785007302430176 55-64 female 1.000000 0 days 0.0
86874 813444159242671 25-34 female 1.000000 0 days 0.0
88144 249667446793436 25-34 female 1.000000 0 days 0.0
100660 268098691427331 35-44 female 0.923077 0 days 0.0
88638 335310441243417 18-24 male 1.000000 0 days 0.0
89169 124698196244795 18-24 female 1.000000 0 days 0.0
90617 427837775138690 65+ female 1.000000 0 days 0.0
96342 1254455388284614 65+ male 1.000000 0 days 0.0
96378 129305299032805 45-54 female 1.000000 0 days 0.0
96396 898997374229403 25-34 female 1.000000 0 days 0.0
27211 999569177234876 35-44 female 1.000000 0 days 0.0
27193 466503074394392 45-54 female 1.000000 0 days 0.0
35176 819086475347103 18-24 female 1.000000 0 days 0.0
27067 320354246113048 25-34 female 1.000000 0 days 0.0

Top 3 ads that were specifically targeted to demographic groups in a quick period of time exhausting the budget reaching maximum audience.

drawing

Here the actual URLs to the archive:

  • https://www.facebook.com/ads/library/?active_status=all&ad_type=political_and_issue_ads&country=NL&q=768477930746491&sort_data[direction]=desc&sort_data[mode]=relevancy_monthly_grouped&search_type=keyword_unordered&media_type=all

  • https://www.facebook.com/ads/library/?active_status=all&ad_type=political_and_issue_ads&country=NL&q=961250341291492&sort_data[direction]=desc&sort_data[mode]=relevancy_monthly_grouped&search_type=keyword_unordered&media_type=all

  • https://www.facebook.com/ads/library/?active_status=all&ad_type=political_and_issue_ads&country=NL&q=2672202179737865&sort_data[direction]=desc&sort_data[mode]=relevancy_monthly_grouped&search_type=keyword_unordered&media_type=all

Top ad by Region¶

Similarly to the demographic groups, one can aggregate the ads impressions by Dutch province.

In [26]:
#df_regions[df_regions['ad_id'] == 321114872792279].sort_values('percentage',ascending = False)
ads_one_region = df_regions\
    .query("region !='All (Automated App Ads)' & percentage > 0.8" )\
    .groupby('region')\
    .count()['ad_id']\
    .reset_index()\
    .sort_values('ad_id')\
    .merge(df_regions.groupby('region').count()['ad_id'].reset_index(), on='region', how='left')\
    .rename(columns={'ad_id_x':'Targeted Ads', 'ad_id_y':'Total Ads'})
ads_one_region['Relative %'] =  ads_one_region['Targeted Ads']/ads_one_region['Total Ads']*100
ads_one_region.head(10)
Out[26]:
region Targeted Ads Total Ads Relative %
0 Friesland 60 8621 0.695975
1 Drenthe 79 4203 1.879610
2 Zeeland 91 8621 1.055562
3 Flevoland 108 8621 1.252755
4 Groningen 123 8621 1.426749
5 Limburg 367 8621 4.257047
6 Utrecht 449 8621 5.208213
7 Overijssel 462 8621 5.359007
8 Zuid-Holland 487 8621 5.648997
9 Gelderland 510 8621 5.915787
In [27]:
px.bar(ads_one_region, x='Relative %', y='region', labels=None,
       orientation='h', color='Targeted Ads', color_continuous_scale='blues',title='Most Microtargeted Regions')

The reading is that 8 out of 100 Ads shown in Noord-Brabant aren't shown anywhere else. On the other hand almost all the ads shown in Friesland aren't intended to be be seen specifically for those regions.

In [28]:
# Grouping the regions and counting the ads over time
regions = ['Noord-Holland','Zuid-Holland','Gelderland','North Brabant',
           'Utrecht','Groningen','Overijssel','Flevoland','Limburg','Friesland','Zeeland']

df_regions_date = \
    df_regions[(df_regions['region'].isin(regions)) & (df_regions['percentage'] > 0.4)]\
    .merge(df_ads, on='ad_id', how='inner')\
    .pivot_table(values='ad_id',index='ad_delivery_start_time', columns='region',aggfunc='count',fill_value=None)
In [29]:
fig = px.line(df_regions_date.loc["2021-02-28":"2021-03-20",])
fig.add_vrect(x0="2021-03-15", x1="2021-03-17", row="all", col=1,
              annotation_text="Election Days", annotation_position="top left",
              fillcolor="green", opacity=0.25, line_width=0)
fig.show()

It is clear to see that during the election days the campaigns were finalized, and a small handful of them continued after the election for some reason. There is a clear peak on Friday (March 12) before the election, where the majority of the ads were displayed, also right before the election days.

Ads Topics¶


Performing N-grams analysis on the text of the ads. It is considered 1-grams to 4-grams terms using Dutch and English dictionaries, finally terms frequency and inverse terms frequency are compared.

In [30]:
import re
import nltk 
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
In [31]:
# Taking into account only unique campaigns
ads_content = df_ads['ad_creative_body'].unique()
ads_content = [str(i).lower() for i in ads_content]
In [32]:
def clean_text_round(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', ' ', text)
    text = re.sub('\w*\d\w*', ' ', text)
    text = re.sub('<.*?>', ' ', text)
    text = re.sub('\n', ' ', text) 
    text = re.sub('\t', ' ', text) 
    text = re.sub('\(b\)\(6\)', ' ', text) 
    text = re.sub('&quot', ' ', text) 
    text = re.sub('---', ' ', text) 
    return text
In [33]:
stop_words = set(stopwords.words(['english','dutch','turkish'])) 
your_list = ['stem','het','onze','we','mij','jou','nl','jouw','mee','wij', 'nl ',' ','jij','nan','per','word','nederland', 'kamer','partij','tweede','stemmen','den','gaat','https','daarom','cda','pvda','fvd','denk','oranje','vvd','groenlinks','sp','ga','www','code','verkiezingen','nummer'] 
for i, line in enumerate(ads_content): 
    ads_content[i] = ' '.join([str(x).lower() for 
        x in nltk.word_tokenize(line) if 
        ( x not in stop_words ) and ( x not in your_list )])
In [34]:
# Getting n-grams table
def ngrams_table(n, list_texts):
    vectorizer = CountVectorizer(ngram_range = (n,n)) 
    X1 = vectorizer.fit_transform(list_texts)  
    features = vectorizer.get_feature_names()
    # Applying TFIDF 
    vectorizer = TfidfVectorizer(ngram_range = (n,n)) 
    X2 = vectorizer.fit_transform(list_texts) 
    # Getting top ranking features 
    sums1 = X1.sum(axis = 0) 
    sums2 = X2.sum(axis = 0) 
    data = [] 
    for col, term in enumerate(features): 
        data.append( (term, sums1[0,col], sums2[0,col] )) 

    return pd.DataFrame(data, columns = ['term','rankCount', 'rankTFIDF']).sort_values('rankCount', ascending = False).reset_index(drop=True)
In [35]:
ads_content = [clean_text_round(text) for text in ads_content]
table_ngrams = pd.DataFrame()
for i in [1,2,3,4]:
    table_ngrams = table_ngrams.append(ngrams_table(i, ads_content))
In [36]:
# drop weird first term
table_ngrams_plot = table_ngrams.iloc[1:,:]\
    .sort_values(by='rankCount')\
    .reset_index(drop=True)\
    .rename(columns={'rankCount':'Term Frequency', 'rankTFIDF':'Inverse Term Frequency', 'term':'N-grams Keywords'})\
    .sort_values('Inverse Term Frequency',ascending=False)
In [37]:
px.bar(table_ngrams_plot.head(40).sort_values(by='Inverse Term Frequency', ascending=True), 
       x='N-grams Keywords', y='Inverse Term Frequency', labels=None,
       orientation='v', color='Term Frequency', color_continuous_scale='RdBu',title='Ads Terms Frecuency')

The n-grams analysis does not necessarily present key insights, mostly we can see generic terms, what would be interesting to analyse is the consistency of the advertisement and the actual campaign plans. Not only that, but this workflows can be adapted to be iterable so that we could see what topics are shown by demographic groups and regions.

Related work and Conslusion¶


There is also some interesting work conducting similar analysis, for example, Roberto Rocha from CBC News reports how 35,000 political ads on Facebook were analysed in Canadian elections. His main focus was to discuss how rules could affect advertising practices. Also, Ondrej Pekacek has created a very awesome monitor of ads in Czech elections. He aims to create an automated workflow, which would inform analysts covering the political communication and financing of Czech elections. Example dashboard

There are some other dozens of projects focused in the US, which are very interesting, but mostly focused on voter fraud conspiracies which is more related to scams in ads. Ultimately, we have discussed the limitations and piloted a data analysis of how the Facebook Ads Library can help on bringing transparency to advertisement practices and to democracy.

In this notebook, we have focused on the Dutch General Elections of 2021 finding not surprising insights, however, what is more, important is to open the discussion whether Facebook should be forced to open their Ads Library to access all types of ads and not only the "Social Issues, Elections or Politics" category, this is where its business model relies on, but also rapidly they're losing trust.


Date: Jun 2021
Author: Pedro V Hernández Serrano
License: Attribution 4.0 International (CC BY 4.0)


  • Last updated:
  • Contributors
    • Facebook Ads Data Analysis
    • Key Insights Summary
    • 1. Accessing the Facebook API as General user and Marketers
    • 2. Accessing the Facebook API as Developer
    • 3. Discovering Key Insights Analysing Ads Data

    Copyright© 2019 - 2021 ‣ Maastrich University

    Search

    From here you can search these documents. Enter your search terms below.

    Keyboard Shortcuts

    Keys Action
    ? Open this help
    n Next page
    p Previous page
    s Search